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Abstract 

We introduce an approach to discovery informatics that uses so called knowl¬ 
edge graphs as the essential representation structure. Knowledge graph is an 
umbrella term that subsumes various approaches to tractable representation of 
large volumes of loosely structured knowledge in a graph form. It has been 
used primarily in the Web and Linked Open Data contexts, but is applicable 
to any other area dealing with knowledge representation. In the perspective of 
our approach motivated by the challenges of discovery informatics, knowledge 
graphs correspond to hypotheses. We present a framework for formalising so 
called hypothesis virtues within knowledge graphs. The framework is based on 
a classic work in philosophy of science, and naturally progresses from mostly 
informative foundational notions to actionable specifications of measures cor¬ 
responding to particular virtues. These measures can consequently be used to 
determine refined sub-sets of knowledge graphs that have large relative potential 
for making discoveries. We validate the proposed framework by experiments in 
literature-based discovery. The experiments have demonstrated the utility of 
our work and its superiority w.r.t. related approaches. 

Keywords: discovery informatics, hypotheses as knowledge graphs, hypothesis 
virtue formalisation, automated knowledge graph construction, evolutionary 
refinement, literature-based discovery 


1. Introduction 

Ever since the dawn of computer age, researchers have been intrigued by 
the possibility of automating the process of discovery E3- Today, the field 
of discovery informatics is getting more relevant than ever before. The large 
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amounts of data that are being made openly available for anyone to explore 
have an immense potential for making new discoveries, and solutions that would 
enable this are highly sought after [15] . 

Knowledge graphs are one of the most universal ways of representing ac¬ 
tionable, data-driven knowledge at large scale [S]. They represent knowledge as 
relationships (edges) between items of interests (vertices), with the possibility 
of adding additional annotations representing for instance multiple relationship 
types ( i.e ., predicates). Such a representation has many advantages like uni¬ 
versal applicability and a wealth of well-founded methods for analysing graph 
structures. Yet the full potential of knowledge graphs for practical applications 
in knowledge discovery is still largely to be explored [5j. 

The motivation of the presented work is two-fold. Firstly, we want to pro¬ 
pose a general framework for defining features of knowledge graphs that can 
determine which parts of the graphs have highest potential for making discover¬ 
ies. We believe that this can facilitate the process of semi-automated knowledge 
discovery in domains that have a lot of data available in graph-like format, but 
suffer from high redundancy and noise ( e.g. : World Wide Web, social networks 
or biological pathway databases). 

The second motivation is more practical. In our previous work [30], we 
addressed the problem of extracting simple knowledge graphs from biomedical 
texts. The graphs were then used for so called machine-aided skim reading - 
high-level navigation of a specific domain represented by a textual corpus which 
was assumed to facilitate the discovery process. Indeed, even highly experienced 
domain experts were able to discover new and relevant facts using the proto¬ 
type system. However, the results also contained some noise and connections 
that were correct, but rather obvious and/or uninteresting. This motivated the 
validation experiments presented here, which demonstrate that our framework 
for formalising hypothesis virtues can tackle the problems of noise, redundancy 
and obviousness in knowledge graphs automatically extracted from texts. 

Our approach consists of formalising features applicable to ranking knowl¬ 
edge graphs (or their partitions) based on their potential for making discoveries. 
This can be used for instance for decomposing knowledge graphs into atomic 
subgraphs and consequent construction of a graph that has higher “discovery 
potential” than the original one. The formalisation is based on widely accepted 
hypothesis virtues studied in philosophy of science [36|. Examples of virtues 
are refutability or generality - a good scientific hypothesis has to be falsifiable 
and should also provide explanations of phenomena outside of its original scope. 
We present general conditions for each of the virtues and proceed with defin¬ 
ing specific measures that conform to these conditions and can be efficiently 
implemented. 

The validation of the approach was performed in the context of literature- 
based discovery m- We extracted knowledge graphs from two de facto standard 
biomedical corpora traditionally used in evaluation of literature-based discovery 
tools. For that we used a very simple and domain-agnostic method that extracts 
statistically significant co-occurrence relationships. We opted for such a solu¬ 
tion to demonstrate the universal applicability of our approach. From these 
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basic graphs, we constructed refined ones using a genetic algorithm that utilises 
the hypothesis virtue measures in the fitness function. The refined graphs were 
analysed according to the evaluation measures used in the literature-based dis¬ 
covery field and compared to related works. The results of the validation were 
positive, as we outperformed the state of the art in most respects. Moreover, 
we discovered relevant relationships that have not been covered by any related 
automated system or manual study. This demonstrates the practical utility of 
our approach. 

Our main contributions are as follows. We have proposed a novel theoreti¬ 
cal framework for extensible definition of measures that can be used to analyse 
the discovery potential of knowledge graphs. We have defined specific measures 
applicable especially to refinement of knowledge graphs automatically extracted 
from texts. We have implemented an evolutionary method for refinement of the 
automatically extracted knowledge graphs that is applicable out-of-the-box to 
any domain where English texts are available. We have demonstrated the prac¬ 
tical relevance of the presented research by a successful experimental validation 
in the field of literature-based discovery. Last but not least, we have provided 
a data package containing a prototype implementation of our approach, results 
and other data necessary for the replication of our experiments. 

The rest of the article is organised as follows. Section [2] presents the general 
framework for formalising the hypothesis virtues in the context of knowledge 
graphs. Section [3] then introduces actual measures that follow the general re¬ 
quirements of the hypothesis virtue formalisations. Our approach is experimen¬ 
tally validated in Section [IJ The section describes the evolutionary refinement 
of knowledge graphs extracted from texts and elaborates on the experiments 
in literature-based discovery. Related approaches are discussed in Section [5j 
Finally, we conclude the article and outline our future work in Section [6] 

2. Formalising Hypothesis Virtues 

The foundations of the presented work are built on [55j . a classic work in 
philosophy of science. The work introduces five virtues of hypothesis: conser¬ 
vatism, modesty, simplicity, generality and refutability. These virtues present a 
comprehensive compilation of the philosophical treatments of discovery ranging 
from antiquity to modern analytical philosophy, and have been frequently used 
as a reference for determining quality of hypotheses in science. 

According to |36j . the virtue of conservatism reflects the fact that good 
hypothesis usually makes rather conservative claims. This is to minimise the 
risk of error by reaching too far from the state of the art in one step (even 
though the combination of the particular conservative claims may go very far 
after all, indeed). Modesty is related to conservatism - a hypothesis A is more 
modest than A and B (since A and B entails A), and a more modest hypothesis 
is considered better as it minimises the risk of wrong and/or redundant claims. 
The simplicity virtue posits that a good hypothesis should simplify our view of 
the world by making new claims about it, even though the claims themselves 
may actually be quite complex. The generality virtue is related to the predictive 
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power of hypothesis - the more phenomena (that have perhaps not even been 
considered originally) it can explain, the better it is. Finally, refutability means 
that a hypothesis should be falsifiable in as obvious manner as possible. This 
is a factor of utmost importance, as discussed in arguably the most influential 
work on this topic (33j . 

In the following, we first define the notions of hypotheses and their claims in 
the context of knowledge graphs (Section |2.1[ ) and then continue with formalising 
the five virtues (Section [ 2 T 2 ] ) . 

2.1. Preliminaries 

First we define a universe - a general knowledge graph within which partic¬ 
ular hypotheses may be defined. 

Definition 1. A universe graph U is a tuple (Vjj, Ejj, A y, Ae) where Vjj is a set 
of vertices, Ejj CVjjxVjjisa set of edges and Ay, A e are sets of labeling maps 
(i.e., morphisms) that associate values with the universe vertices and edges, 
respectively. 

The labeling maps can, for instance, assign predicate types to edges in seman¬ 
tic networks, assert vertex types like class or individual in ontology knowledge 
graphs, or associate confidence weights with edges of automatically extracted 
knowledge graphs. Such a definition can accommodate a broad range of knowl¬ 
edge graphs with varying levels of semantic complexity, while keeping the basic 
structure still compatible with the analysis methods introduced here. The uni¬ 
verse can be either directed or undirected. The experiments presented in this 
article deal with an undirected universe and therefore we assume undirected 
graphs in the following unless explicitly stated otherwise. 

A hypothesis in a universe is defined as follows. 

Definition 2. A hypothesis H = (Vh, Eh, Ay , A^) is a subgraph of the uni¬ 
verse U such that Vh C Vjj,Eh C Ejj and VA y £ Ay 3Ay £ Ay. Ay C 
Ay, VAf G Af 3Xe £ Ae. Af C X E . 

The second defining condition of the hypothesis subgraph means that any spe¬ 
cific labeling map employed by a hypothesis has to be subsumed by a map 
defined in the universe. This ensures that the universe is closed w.r.t. possible 
interpretations of the hypotheses existing within it. 

Most of the hypothesis virtues critically depend on what a claim of a hy¬ 
pothesis is, and therefore we need to define that as well. 

Definition 3. A claim of a hypothesis H is a simple (i.e., acyclic) path in the 
graph H. 

Such a definition presents arguably the most universal view on what a particular 
knowledge graph may express. No matter what the actual semantics of the 
relationships in a hypothesis graph are, one can always study what they claim 
at least in terms of connections of vertices by means of edges, i.e., paths (we 
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will use the terms path and claim interchangeably in the rest of the article). 
This makes our approach applicable to any type of knowledge graph. 

Note that one practical implication of the last definition is that we can 
consider only connected graphs as hypotheses - if there is no path between two 
vertices, no claim is being made about them and they should thus be parts of 
different hypotheses. This is partly related to the open/closed world assumption 
dichotomy. The fact there is no connection between vertices does not mean no 
such connection can exist, it only means nothing is known about it in the context 
of the given knowledge graph. 

The final preliminary definition concerns all claims possibly made by a hy¬ 
pothesis. 

Definition 4. A claim set of a hypothesis H is the set 11(7?) of all simple paths 
in the corresponding graph. A claim volume of H is the size of its claim set, 

i.e., II:// i . 

The claim volume can be very large and is hard to compute even for relatively 
small graphs m- Also, it is not realistic to expect every possible path in a 
knowledge graph to convey a meaningful claim. Therefore in practice, it is 
convenient to restrict the claim set to a more manageable size based on case- 
specific heuristics. However, the maximal possible number of claims is apt as 
a theoretical notion for describing general knowledge graphs without further 
information about their domain and more complex semantics. 

2.2. Formalising the Virtues 

The following five sections present formalisations of the particular hypothe¬ 
sis virtues using the preliminary notions introduced above. Note that we pro¬ 
vide general guidelines for measuring the virtues first, giving minimalistic set of 
conditions the measures should satisfy. Detailed examples of specific measures 
facilitating literature-based discovery are discussed in Sections [3] and [4] 

2.2.1. Conservatism 

Conservative claims should make small steps in a particular direction, how¬ 
ever, the combination of the steps can potentially be quite radical (i.e., far- 
reaching). The conservatism of a path in a hypothesis H can be measured by a 
function / : n(i/) — > R that satisfies the following conditions: 

1. Assuming a metric S : Vjj x Vjj —> R on the vertices in the universe graph, 
the function / applied to a path p = (v±, V2, ■ ■ • , U| p |) is negatively cor- 
relatec0 with the p({( 5 (ui, V2), S(v2, V3), ..., < 5 (w| p |_i, W| p |)}) value, where 
g : 2 r —>■ R is an aggregation function (e.g., sum, mimimum, maximum or 
arithmetic mean). 


1 Here and in the following, we use broad notions of positive and negative correlation. They 
are meant to generalise the respective notions of proportionality and inverse proportionality to 
possibly non-linear, non-algebraic or statistical relationships that may be specific to particular 
applications. 
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2. If radical claims are preferred, then there is an additional requirement for 
/ being positively correlated with the £(wi, uipi) value. 

The conservatism of the whole hypothesis H is computed by aggregating 
all path conservatism measures across the 11(17) set. The higher the aggregate 
value, the larger the conservatism. Due to the complexity of enumerating the 
II {H) set, practical conservatism measures can target only a subset of all possible 
paths. For instance, a set of shortest paths between all vertices in H w.r.t. the 
S edge labeling is a viable option as it is comparatively easier to compute and 
already satisfies condition 1. if sum is used as an aggregation function. 

2.2.2. Modesty 

Let us refer by H u to the complete graph corresponding to a hypothesis H 
(i.e., a graph with an edge between any two vertices in Vh). Then the modesty 
of H can be defined as 

|n(g„)| 

|n(i7)| ■ 

This number reflects the ratio between all possible claims about the entities 
covered by H and the actual number of claims being made. The higher the 
ratio, the larger the modesty (a modest hypothesis minimises the number of 
claims made in relation to the number of claims that can possibly be made). 

As mentioned before, computing the number of all simple paths in a graph is 
extremely difficult in general. Therefore in practice, approximations of the mod¬ 
esty measure are necessary. The approximations, however, should be monotonic 
w.r.t. the ideal modesty measure: assuming f,g as the ideal and approximate 
modesty measures, respectively, then g{H) > g(I) if and only if f(H) > /(/) 
for any two hypotheses H , I. 

2.2.3. Simplicity 

For this virtue, we use the dual notion of complexity which has been exten¬ 
sively studied in the context of graphs [25). A good hypothesis should simplify 
our view of the world despite of possibly being locally complex [36j . In order 
to formalise this intuition, let us assume the simplicity of a graph is measured 
by a function / : Qu —> R, where Qu is a set of all graphs conceivable in the 
universe U. The function / should satisfy these conditions: 

1. Given a hypothesis graph H and a graph complexity measure c : Qu —> R, 
/ is positively correlated with the expression 

c(U\H) 

c(U) 

which reflects the universe simplification rate w.r.t. to the hypothesi^] 


2 From here on, we use the set-theoretic operators for graphs as a convenience notation 
for the operations applied on the corresponding vertex and edge sets in the actual tuple 
representations of the graphs. The labeling sets of the result are assumed to be Ay,A^, 
i.e., the universe ones, unless specified otherwise. 
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2. If locally complex hypotheses are preferred, then the function / is also 
required to be positively correlated with the value c{H). 

Strictly speaking, the rate in the first condition should also be higher than 1 
in order for the hypothesis to make the universe actually simpler, but practical 
applications may relax that requirement and just rank the hypotheses based on 
the measure. 

2.2.4- Generality 

Generality can be quantified as a number of explanations {i.e., claims) the 
hypothesis H can provide for ‘out-of-scope’ phenomena {i.e., vertices) in the 
U\H graph. This can be expressed as 

g({f(u)\u G Vjj \ V H })i 

where the function g : 2 R —> R. is an aggregation (like sum or arithmetic mean) 
over all vertices that are out of the H scope. The function / : Vjj —> M is required 
to be positively correlated with the /i({|{u|t; £ pAv G Vh}\ \ p G n u ([/)}) value, 
where h : 2 R — > R. is another aggregation function and II U (U) is a set of all simple 
paths in the universe U that start in the vertex u. 

The generality definition reflects the basic intuition that the higher the num¬ 
ber of H vertices on paths explaining phenomena outside of H , the higher the 
generality of H. As the numbers of simple paths can be difficult to compute 
even if limited to paths starting in single nodes, approximations of this measure 
are needed for implementations again. Similarly to the modesty condition, we 
require the approximations to be monotonic w.r.t. the ideal generality measure. 

2.2.5. Refutability 

Refutability can be seen as a quantification of: 1) the easiness with which 
the claim volume |II(7?)| of a particular hypothesis graph H can be reduced; 
2) the rate of the reduction. The atomic part of the process of refutation in 
the context of knowledge graphs is an invalidation, i.e., removal, of a vertex. 
Let us assume a decreasing ranking R : N —> Vjj of the vertices in H based on 
the number of simple paths that no longer exist in the graph after the vertex 
removal. Then we can define a top-k refutability as 

|n(g)| 

|n(ff)| +Eli |n(tf/i?(f))|’ 

where H / R{i) is a graph resulting from removal of the first vertex in the ranking 
R from the graph H/R{i — 1). We assert H/R{ 0) = H by definition. The lower 
the number of paths still existing after removing the top vertex according to R, 
the higher the refutability. The |II(i?)| expression is added to the denominator 
to avoid potential division by zero, and also to normalise the measure value. 

Note that for growing k values, the top-k refutability generally converges 
to similar values for any given set of hypotheses as the measure is relative to 
the total number of paths in the graph. Therefore it is practical to use the 
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measure with rather low k values, perhaps even as low as 1 which measures 
the rate of refutability in a single vertex removal step. Additionally, the ideal 
measure is difficult to compute and approximations are required in practice 
again. In particular, one can approximate the II function in the vertex ranking 
and refutability definition with one that is monotonic w.r.t. it. 

3. Specific Virtue Measures 

In this part, we introduce specific instances of hypothesis virtue measures 
following the general formalisation presented before. First we give an example 
of a universe and a couple of associated structures in Section |3.1[ These will 
be used for running examples illustrating the measure details in Section |3.2| 
Finally, Section [373] describes how to use the measures in concert. 

3.1. Sample Universe 

The examples throughout this section are all based on an illustrative universe 
graph U depicted in Figure [T[ The graph features real-valued edge labels in the 



Figure 1: Sample universe graph U 


(0,1] interval that represent confidence weights of the edges (the higher the label 
the higher the expected degree of association between the corresponding ver¬ 
tices) . These edge labels are used when constructing several auxiliary resources 
from the graph. There are no specific types of edges ( i.e ., predicates) in the 
examples since in the experiments reported in this article, we focus only on one 
type of relationship based on automatically extracted co-occurrence statements. 

First of all, we need to define a metric on the vertices. The most straight¬ 
forward option without any background knowledge on the graph is to use its 




(weighted) adjacency matrix for constructing characteristic context vectors for 
every vertex. The vectors can then be used for computing the actual metric. 
The adjacency matrix Ajj of U is presented in Table [l] The context vector x 
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Table 1: Weighted adjacency matrix Au 


for a vertex x is the row (or column, as the graph is undirected) corresponding 
to x in the adjacency matrix Ajj. Using the context vectors, we can define the 
Euclidean distance (be., a metric) on the vertices as 8{x, y) = \/Y^i=i{ x i ~ Vi) 2 
where Xi,yi correspond to the i-th elements of the x, y context vectors, respec¬ 
tively. The specific distances (up to 4-th decimal point) between the universe 
vertices are given in Table [2] 
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Table 2: Distance matrix Du 

The last auxiliary structure we will need in the following sections (namely 
for defining complexity measures) is clustering of the vertices in U. An example 
of a possible clustering is given in Figure [2j It is an overlapping clustering that 



Figure 2: Cluster structure of U 


groups vertices with mutual distances below 1.5 (in the actual implementation 
of our approach, we use more sophisticated clustering method as explained in 
detail in Section [4.2.1 ). The clustering contains three clusters A : {1,2 },B : 
{1,4,5,6}, C : {3,4,6}. Note that the clustering can either be computed from 
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the universe graph itself or provided externally ( e.g ., in the form of an ontology 
that defines a taxonomy upon the graph vertices). 

3.2. Measure Definitions 

Having introduced the sample universe, we can continue with the specific 
measure definitions which we use later on in the literature-based discovery ex¬ 
periments. 


3.2.1. Conservatism 

Following the conditions provided in Section |2.2.1[ we define a specific in¬ 
stance of the hypothesis graph conservatism measure 


C(H) 


1 

\*s(H,6) | 


E . 

pE 7r s ( H,5 ) 


T}& 1 s{v i ,v i+1 y 


where n s (H, 6) is a set of all shortest paths in H w.r.t. the Euclidean distance S 
and p = (vi, V 2 , ■ ■ ■, U| p |) is a specific shortest path of length \p\. In other words, 
the C measure is an arithmetic mean of the shortest path conservatism values 
where the path conservatism is computed as a fraction of the distance between 
the extreme vertices of the path and the path length)^] 

The measure satisfies the condition 1. from Section [2.2.1| as it already focuses 
only on paths with minimal aggregate distance between the consecutive vertices 
(assuming the sum aggregation). The condition 2. is satisfied as well. For any 
path p, <5(ui, i>| p |) < )Ci=i 1 $( v i> v i+i)- The equality is achieved if and only 
if the context vectors of the consecutive vertices represent points that lie in a 
straight line, i.e., maximise the distance between the extreme vertices of the 
path. Therefore the maximum value 1 of the path conservatism measures is 
achieved exactly when the extreme distance is maximal. 


Example 1. In Figure [3] there are three hypothesis graphs E,F,G that exist 
in the universe U described in Section \3.1\ The edges are annotated with the 
Euclidean distance <5 based on the vertex context vectors (see the examples in 
Section 3.1 for details). 

The numbers of all shortest paths for the hypothesis graphs E , F, G w.r.t. the 
distance S are 3,3,6, respectively. The conservatism measures of the hypotheses 
are 

C(E)= l -( 1 - + 1 - + \) = 0M, 

1,1.3229 1.8708 1.5. 1 

3^1.3229 + 1.8708 + 1.5^“ ’ 




1.8028 

1.8028 


1 1.5 

1 + 2J5 


1 


+ x) =0.85, 


3 Note that if there is only one shortest path guaranteed to exist between any pair of vertices 
in the H graph, then \n s (H, 5)| = as H is expected to be connected. 
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Figure 3: Sample hypothesis graphs E, F, G 


therefore the hypotheses can be ranked in the 

F )-c G )^c E 

order from the most to the least conservative on^J 


3.2.2. Modesty 

As an approximation of the the ideal modesty measure presented in Sec¬ 
tion 2.2.2 we use inverse density of the hypothesis graph 


M{H) 


\Vh\(\Vh\ ~ 1) 
2\E h \ 


This function is much easier to compute than the ideal one and is monotonic 
w.r.t. it. Since the enumerators of both functions are fixed, we only need to 
show that the number of edges is monotonic w.r.t. number of all simple paths in 
a hypothesis graph. This is quite easy - increase in \Ejj\ ( i.e ., adding an edge) 
will cause |n(ff)| to grow as well since adding an edge will result in at least 
one new simple path in H , the edge itself. Conversely, if the set II (H) grows, 
it means that edges had to be added to the H graph as it is the only way how 
the overall number of paths can be increased. 


Example 2. The number of edges in the E,F,G graphs from Example [7] is 
2,3,4, respectively , while the maximum possible number of edges in the corre¬ 
sponding complete graphs is 3,3,6. Therefore the modesty values are 

M(E ) = | = 1.5, M(F) = | = 1, M(G) = ® = 1.5 


4 From here on, we use convenience ordering relations yx f° r ranking the hypotheses in a 
decreasing order according to a specific measure X. E >~x F if and only if X(E) > X(F). 
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and the modesty ranking of the hypotheses is 


E F, G >~m F. 


3.2.3. Simplicity 

As stated in Section [2.2.3 we use the dual notion of complexity for measur¬ 
ing hypothesis simplicity. For the specific instance of the measure, we employ 
Shannon’s entropy that has been frequently used for graph complexity [25]. To 
define the entropy, we utilise the clustering of the hypothesis graph vertices 
based on their context vectors. Let us assume a vertex labeling 7 : Vu —t 2 L 
where L is a set of cluster identifiers. Then we can define a cluster association 
probability p(l , H) for a specific cluster l £ L within a hypothesis H as 


p(l,H) = 


|{w|n £Vh A Z € 7(u)}| 

ivy ' 


It is a probability that a randomly selected vertex from H belongs to a cluster 
l. If we conceive clusters as higher-level topics the hypothesis graph deals with, 
then the probability reflects the distribution of the topics across the graph. The 
p(l, H) values can be used for computing the cluster association entropy for a 
hypothesis H as 

E(H) = ~'Y^p{l,H)\og 2 p{l,H). 

leL 

It reflects the information value of the hypothesis’ cluster structure - the more 
“unpredictably” distributed clusters, the higher the complexity and also the 
information value. This conforms to an intuitive assumption that hypotheses 
dealing with more topics representatively are more informative, i.e., complex. 

We define two simplicity measures that employ the cluster association en¬ 


tropy and satisfy the respective conditions introduced in Section 2.2.3 


S 1 (H) = E(H ), S 2 (H) = 


E(U\H) 
E(U) ■ 


We use both measures in the following to capture different aspects of simplicity 
simultaneously. 


Example 3. In Figure^ there are the three hypotheses graphs E,F,G and the 
universe graph U depicted again, but this time with cluster annotations provided 
as vertex labels. The cluster association probabilities for each graph are 

p(A,U) = p(B,U) = jj, p(C,U) = 

P(A,E) = 0, p(B,E) = 1, p(C,E) = ?, 
p(A,F) = ^, p(B,F) = ^, p(C,F)= 1 -, 
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u 




p{A,G) = \,v{B,G) = - A ,p(C,G) = \. 

The entropies corresponding to these probabilities are 

E(U) = 1.4183, E(E) = 0.39, E(F) = 2.78, E(G) = 1.3113, 

E(U \E)= 2.78, E{U \F) = 0.39, E(U \G) = 1.5. 

The hypothesis F is the lowest-ranking no matter which function we use - it has 
the lowest entropy and E(U \F) < E(U), therefore it makes the universe more 
complex. On the other hand, both E, G increase the simplicity of the universe. 
If only local complexity of the acceptable hypotheses is relevant (measure SI), 
then the final ranking is 

G >-si E >-si E 

since E(G) > E(E). However, if the rate of simplifying the universe is more 
important (measure S2), the ranking is 

E >~S2 G ^52 F 
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as 


E(U \ E) 

E(U) 


= 1.9601 > 1.0576 = 


E{U\G) 

E(U) 


3.2.4■ Generality 

To limit the potentially intractable number of paths in the ideal generality 
formula introduced in Section |2.2.4| we apply two approximations in its spec¬ 
ification. Firstly, we focus only on explanations for the universe vertices 
that are immediately adjacent to the measured hypothesis H. The set of edges 
that connect these vertices to H can then be defined as E% = {(u,i;)|(u, v) £ 
Eu A (u £ Vj*, v £ Vh V u £ Vh,v £ )}. The second approximation consists 

of focusing only on shortest paths w.r.t. the S distance. The specific generality 
measure is then defined as 

G(H) = \{p\p£ n s (U,5)Ap 1 £ Vjf A p 2 ,P 3 , ■ ■ ■ ,P\ P \ e V H }\. 

The measure corresponds to the number of shortest paths that start in an ad¬ 
jacent vertex and connect it with vertices in the hypothesis graph H , thus pro¬ 
viding an explanation for it using only H. As the graphs H are assumed to be 
connected, the measure can further be simplified as G(H) = \E^\(1+{Vh — 1)) = 
\Ej[ ||Vff| for graphs where only one shortest path exists between any two ver¬ 
tices. 

The G(H) measure uses sum aggregation as the g function present in the 
general definition. The / function that leads to the presented definition of G(H) 
returns zero for any vertex from the Vjj\Vh set that is not immediately adjacent 
to H. For other vertices, it returns the number of paths that provide explanation 
for them in H. This number is positively correlated with the number of vertices 
in Vh as required in the general definition, since the number of paths leading 
from a vertex to other vertices in a connected graph H is Vh — 1 (or more if 
multiple shortest paths exist between some vertices). 

The restriction to the immediately adjacent vertices leads to a narrowly 
focused generality and helps to reduce combinatorial explosion resulting from 
taking the whole universe graph into account. The shortest path approximation 
is a reasonable limitation as these paths are more likely to be conservative 
explanations. It is not strictly monotonic w.r.t. the ideal generality measure, 
though. If the number of shortest paths increases, then the number of all paths 
naturally has to be higher as well. The other direction is less obvious, and 
conditional. Assuming the number of all paths in a graph has increased, we 
have to show that there also has to be more shortest paths. This is not true in 
general - if edges between distant vertices are added, they may not contribute to 
increasing the number of shortest paths. However, since the measure intuitively 
captures the notion of generality in the context of knowledge graphs and is easy 
to compute, we decided to relax the absolute monotonicity requirement for the 
sake of practicality. 

Example 4. The sets vertices adjacent to the E , F, G hypotheses are 
Ef = {2}, v£ = {4,6}, Vjj = {1,3} 
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and the corresponding sets of connecting edges are 

Ef = E F A = {(4, 2), (6, 2)}, E g a = {(2,1), (2,3)}. 

Since there is only one shortest path between any pair of vertices in our example, 
the generality measures are 

G(E ) = \E F \ - |Ve;| = 2-3 = 6, G(F) = \E F A \■ \V F \ = 2 • 3 = 6, 

G(G) = \E a \ ■ \Vg\ =2-4 = 8 
and the resulting ranking is 

G >~g E, G >~g F. 


3.2.5. Refutability 

Using the shortest paths approximation again, we define a specific refutabil¬ 
ity measure as 


Rk{H) = 




Similarly to Section |3.2.4| we consider only the shortest paths instead of all 
simple ones, which makes the computation of the measure comparatively easier. 
Such an approximation is unfortunately not strictly monotonic as shown be¬ 
fore, however, we believe that the practicality and intuitiveness of the measure 
outweighs the partial monotonicity violation. 

For the ranking R of the vertices in the Rk(H) measure computation, we 
use the betweenness centrality which is defined as 


c b {v,G) 


\{p\pe tt s (G,S)Av €p}| 
MG, 5) | 


where v is a vertex and G is a graph. In other words, betweenness centrality of a 
vertex is the number of shortest paths passing though it divided by total number 
of shortest paths. The ranking R ranks the vertices in a decreasing order based 
on their betweenness centrality. Such ranking generally does not mean that 
removal of a high-ranking vertex results in a higher number of shortest paths 
disappearing when compared to a removal of a lower-ranking vertex - if the 
graph remains connected in both cases, the number of shortest paths in it will be 
the same after removal of either node. However, removing a vertex with higher 
betweenness centrality will result in relative increase of the remaining paths’ 
lengths. This can lead to a decrease of the graph conservatism and thus also to a 
decrease of its overall value w.r.t. the hypothesis virtues. Consequently, making 
a hypothesis weaker more quickly can be seen as refuting it more efficiently. We 
believe that this justifies the chosen ranking even though it means yet another 
relaxation of the general requirements]^] 


5 An alternative option that fully conforms to the requirements would employ simple paths 
instead of shortest ones and vertex degree instead of betweenness centrality, however, such a 
solution can easily become intractable. 
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Example 5. The sets of shortest paths w.r.t. the 6 distance for the particular 
hypothesis graphs are 


7r s (F,<5) = {(5,4),(5,4,6),(4,6)}, 

tt s (F,<5) = {(2,1),(2,3),(1,3)}, 
n s (G, S) = {(2,4), (2,4,5), (2,6), (4,5), (4,6), (5,4,6)}. 

The corresponding vertex betweenness centralities are then 

c B { 4, E) = 1, c B ( 5, E) = c b (6, E) = 0.6, 

c B (l,F) = c b (2,F) = c b (3,F) = 0.6, 

c B ( 2, G) = c s (5, G) = c B ( 6, G) = 0.5, c B (4, G) = 0.83. 

TTie top-1 refutability measure for the hypothesis E can be computed as follows. 
The centrality-based ranking of the vertices places 4 on the top, therefore we 
remove it. The result is a disconnected graph consisting of isolated vertices 5,6 
where no path exists anymore. The top-1 refutability measure of E is thus 

R(£ ' 1) = 3T0 = L 

Similarly, the top-1 refutability measures for the remaining two hypotheses (with 
arbitrary removal vertex selection for F due to uniform centrality ranking) are 

R(F, 1) = —^ = 0.75, R(G , 1) = —— = 0.857142. 
v ) 3+1 v ' 6+1 

The resulting refutability ranking of E, F, G is 

E >- R G +_r F. 


3.3. Combining the Measures 

The specific measures defined in the previous section can be used to rank 
the hypothesis graphs independently of each other as shown in the examples. 
However, practical applications will very often imply the necessity to compare 
hypotheses along all the measures. Lacking any a priori information on which 
measures may be more relevant for a particular application, we propose the 
following way of ordering the hypothesis graphs. 

Let TL = {H\, H' 2 ,..., H n } be the set of hypothesis graphs we wish to com¬ 
pare according to a set of measures X = {Xi, X 2 , ■ ■ ■, X m } of equal impor¬ 
tance. Then we can construct an edge-labeled directed ranking multigraph 
1Z = (TL,£ C TL x H, \ : £ —tX). The multigraph’s vertices are the hypotheses 
in TL. The edge set and the labeling function is constructed from the specific 
measure rankings so that ( Hi,Hj ) £ £, \(Hi, Hj) = Xj. if and only if there is 
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a measure Xk such that Hi Fx k Hj- Using the ranking multigraph TZ, we can 
define a combined ranking relation >- on the set Ti x Ti as 


Hi Hj if and only if 


d 0 (Hj,1l) 

d 0 ( Hi , TZ) + di (Hi , TZ) 


d 0 (Hj, TV) 

d 0 (Hj,K) + di(Hj,liy 


where di(H x , TV), d 0 (H x , TV) is the in-degree and out-degree of the vertex H x in 
the multigraph TZ , respectively. In plain words, the combined ranking relation 
>~ orders the hypotheses based on the relative magnitude of their superiority 
(out-degree) w.r.t. the specific ranking relations given by the measures. 


Example 6. Figure [5] shows the ranking multigraph corresponding to Exam¬ 
ples'?^ A directed edge from vertex X to Y with a label Z means that X )^z Y. 



Figure 5: Ranking multigraph for E, F,G 


The in-degrees and out-degrees of E, F, G in the ranking graph are 


di(E) = 3, d a (E) = 4, dk(F) = 6, d a (F) = 1, d z (G) = 3, d a (G) = 7, 


therefore 


G >- E >- F 


since 


7 4 1 

10 > 7 > 7' 


4. Experimental Validation 

In order to validate the proposed formalisation of hypothesis virtues in the 
context of knowledge graphs, we chose to follow-up on our work presented in [30] 
where we addressed automated extraction of conceptual networks from biomed¬ 
ical literature. The work deals with extraction of co-occurrence and similarity 
relationships from abstracts available on PubMed (c./., http: //www. ncbi . nlm. 
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nih.gov/pubmed) and consequent indexing, querying and navigation of the net¬ 
works in a knowledge discovery scenario. 

As we have shown in [30], the automatically extracted networks can already 
provide useful insights even for experts in the field, however, they still contain 
some noise and irrelevant and/or obvious information. Tackling this challenge 
has been the main practical motivation for the research presented in this article. 
We believe we can use our approach to identify portions of the automatically 
extracted graphs that can not only provide general overview of the domain with 
less noise, but also isolate valid relationships that are surprising for experts. 
This can ultimately lead to more efficient machine-aided discovery applications. 

In our validation experiments, we utilise the scenarios, data sets and eval¬ 
uation methodologies elaborated within the field of literature-based discovery 
which we introduce in Section 14.11 below. 


Section 4.2 is the methodological 
core of this part. It presents an evolutionary approach to the refinement of au¬ 
tomatically extracted knowledge graphs using the hypothesis virtue measures. 
Section |4.3| describes the data sets and methods we use for the experimental 
evaluation. Finally, Section |4~4| discusses the results of the experiments. 

Note that we have implemented our approach and the experiments reported 
in this section using a Python prototype available under the GPL free software 
license. The corresponding code, experimental data and results are available 
at http: //skimmr. org/hyperkraph/j/j Detailed README documentation on 
the implementation and data is provided as a part of the respective archives 
hosted at the referenced URL. 


4-1. Literature-Based Discovery 

The field of literature-based discovery is widely considered to stem from the 
work [44] . Based on [44] and a follow-up article [45], the work l 43| introduced 
the notion of Swanson linking - connecting two pieces of knowledge in isolated 
documents A and B using concepts from intermediate documents (C) that are 
directly or indirectly related to A and B. Surveys of recent works addressing 
this problem are provided in [3 M HQ- 

The application of our framework to refining knowledge graphs automati¬ 
cally extracted from literature is closely related to literature-based discovery. 
Our goal is to generate a set of graphs that reflect relationships between terms 
in literature and are optimised w.r.t. hypothesis virtues. Such a structure can 
very straightforwardly facilitate the process of finding “interesting” links be¬ 
tween isolated concepts via intermediates, which is the key problem of literature 
discovery. Therefore we can use the standard approaches and man-made “gold 
standard” discoveries from that field to experimentally validate our approach in 
an established application scenario. 


6 HYPERKRAPH is a general name we use for the ongoing implementation of prototypes 
based on the presented research. It stands for HYPothEsis viRtues in Knowledge gRAPHs. 
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4-2. Evolutionary Refinement of Automatically Extracted Knowledge Graphs 

The basic assumption we use for validating our framework is that applying 
the hypothesis virtue measures to refining graphs extracted from literature will 
facilitate literature-based discovery tasks better than the unrefined graphs. To 
verify this, we have to tackle the graph refinement first. The key question is: 
Given a knowledge graph based on statements automatically extracted from text, 
how can we refine it so that only the parts of the graph that have comparatively 
high hypothesis virtue measures remain? 

This is essentially an optimisation problem in which we know how to tell 
whether a solution X is better than Y, but we do not know much about what 
the actual solutions are and how the main knowledge graph is (or should be) 
composed of them. Such problems can quite efficiently be tackled by evolution¬ 
ary computing [5]. In the rest of this section, we describe a specific algorithm 
for evolutionary refinement of knowledge graphs. 

4- 2.1. Extracting a Universe Graph from Texts 

Figure [ 6 ] presents the high-level overview of the graph extraction and refine¬ 
ment process. First we use our SKIMMR tool [3D] to extract basic co-occurrence 
statements from the input texts. The statements are in the shape of tuples 
(ti,t 2 ,Wd,T), where t\,t 2 are two terms that co-occur in an input text T and 
Wd is the weight of the co-occurrence based on the sentence distances of the 
terms within T. 

In the next step (M2 in Figure [ 6 ]), we: 

1. Use the basic statements to compute corpus-wide co-occurrence weights 
using normalised point-wise mutual information. 

2. Encode the terms in the statements using integer identifiers (to optimise 
the memory usage in the consequent steps). 

3. Build a fulltext index upon the lexical vertex labels for accessing them 
during the evaluation (this mitigates the impact of spelling alternatives 
and other irregularities in the automatically extracted names). 

4. Initialise an undirected edge-labeled universe graph U with edges con¬ 
structed from the corpus-wide statements. The graph can possibly be 
limited to edges with normalised point-wise mutual information weights 
above a pre-defined threshold. 

5. Construct a context vector space for the U vertices based on their neigh¬ 
bors and corresponding edge weights. 

6 . Use the vector space to compute the Euclidean distances between the 
vertices. 

Steps M3 and M4 in Figure[ 6 ]perform the K-means clustering of the universe 
graph U in order to provide a vertex labeling 7 that associates each vertex with 
cluster(s) it belongs to (see Section [4. 3.3| for details on the K-means settings in 


19 







Figure 6: High-level workflow of the graph construction and refinement 
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Dl: universe graph 


Ml: initialisation 




' D2: population ; 




M2: mutation, crossover and validation 


D3: expanded population 
1 IF NOT TERMINATING \ 


M3: ranking and trimming 


D4: trimmed population ! 


Figure 7: Detailed workflow of the evolutionary refinement 


the particular experiments we conducted). At this moment, everything is ready 
for optimising U according to the hypothesis virtue measures of its sub-graphs. 


4-2.2. Evolutionary Graph Refinement 

The optimisation step in Figure [6] is performed using a genetic algorithm |9j. 
Its detailed workflow is presented in Figure [TJ The genetic algorithm has the 
following configurable parameters: 1. mutation and mating probabilities p m ,p c 
defining how likely it is for an individual in a population to mutate and mate 
(j.e., engage in a crossover with another individual); 2. number k m defining how 
many times an individual can attempt to mate in a generation; 3. maximum 
number of generations Nq', 4. rate p p of the standard deviation of the population 
size - it sets the size of the population Pj to |P,| = gauss(\Pi_i\, p p |P,;_i|) where 
gauss(g, a) returns a random number from the normal distribution with mean 
g and standard deviation cr, truncated to integer; 5. the mean and standard 
deviation p,. a, for determining the sizes of the individuals in the initial popu¬ 
lation. For specific values of the parameters and discussion of their influence on 
the evolution process in our experiments, see Section 4.3.3| 

The population is initialised (step Ml in Figure [7]) by a repetitive random 
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selection of possibly overlapping stars of size gauss(g,i,<Ji) from the graph U. 
Stars consist of one “hub” vertex and a set of vertices “fanning out” of the hub 
via immediate edges. They are a specific type of sub-graphs that can be used as 
atomic graph construction blocks |25| and thus they are fitting for the purpose 
of population initialisation. 

Step M2 in Figure [7] consists of applying the evolutionary operators on the 
population and consequent validation of the newly added individuals which 
discards disconnected ones. The mutation deletes or adds an edge from/to the 
individual graph with equal probabilities. The crossover combines two parents 
by randomly selecting half of the edges from each parent and combining them in 
a new individual. All existing edge labels are copied in the process of creating 
new individuals. 

Step M3 in Figure [7] is essential for the optimisation - it computes the 
hypothesis virtue measures of each individual in the expanded population and 
then ranks the population according to the combined ranking >- introduced in 
Section |3.3| The population is then trimmed to a random size based on the 
previous population size (computed using the p p parameter). 

Steps M2 and M3 are repeated until a termination condition is met. This 
can either be reaching a pre-defined number of generations Nq, or achieving 
some sort of population convergence. 

4-3. Description of the Experiments 

For the evaluation of our approach we chose two standard scenarios in 
literature-based discovery based on the works (331 ‘351. Details on the corre¬ 
sponding data sets and experiments we performed using them are described in 
the following sections. 

f.3.1. Data Acquisition 

We used two data sets in the experimental evaluation, both of which address 
discovery of connections between previously isolated concepts (and correspond¬ 
ing bodies of literature). One data set is based on [33] that explores the rela¬ 
tionship between fish oil and Raynaud’s syndrome. The other data set is based 
on similar study of previously neglected connections between migraine and mag¬ 
nesium [35]. We refer to these two data sets and corresponding experiments as 
to Tf>. T Ml respectively. 

The initial corpora of texts for the Tr,Im experiments were obtained from 
PubMed via queries compiled according to the specifications given in [33; ;35J. 
Each of these works defines source and target terms t s ,tt together with a set I c 
of intermediate terms that connect them. A query for the PubMed abstracts 
corresponding to specific t s , t t , I c is compiled as a disjunction of atomic conjunc¬ 
tions 

V A *c) 

t£{ts,tt},t c ei c 

The particular queries we used for obtaining the Tr,Tm corpora were 
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("raynaud" AND "blood") OR ("raynaud" AND "viscosity") OR 
("raynaud" AND "platelet") OR ("raynaud" AND "vascular") OR 
("raynaud" AND "reactivity") OR ("fish oil" AND "blood") OR 
("fish oil" AND "viscosity") OR ("fish oil" AND "platelet") OR 
("fish oil" AND "vascular") OR ("fish oil" AND "reactivity") 


and 


("migraine" AND "vasospasm") OR ("migraine" AND "spreading depres¬ 
sion") OR ("migraine" AND "vascular reactivity") OR ("migraine" 

AND "depolarization") OR ("migraine" AND "epilepsy") OR ("migraine" 
AND "inflammation") OR ("migraine" AND "prostaglandins") OR 
("migraine" AND "platelet aggregation") OR ("migraine" AND "sero¬ 
tonin") OR ("migraine" AND "brain anoxia") OR ("migraine" AND "cal¬ 
cium channel blockers") OR ("magnesium" AND "vasospasm") OR ("mag¬ 
nesium" AND "spreading depression") OR ("magnesium" AND "vascular 
reactivity") OR ("magnesium" AND "depolarization") OR ("magnesium" 
AND "epilepsy") OR ("magnesium" AND "inflammation") OR ("magnesium" 
AND "prostaglandins") OR ("magnesium" AND "platelet aggregation") 

OR ("magnesium" AND "serotonin") OR ("magnesium" AND "brain ano¬ 
xia") OR ("magnesium" AND "calcium channel blockers") 

respectively. Note that the while the Tm query exactly corresponds to the terms 
given in [25], the Tr query is relaxed to sub-terms as the exact query only yields 
very few abstracts. The PubMed search was limited to articles indexed until 
November, 1985 and August, 1987 for Tr, Tm- respectively, so that we can 
compare ourselves to the findings of the original works which have served as a 
de facto gold standard in the literature-based discovery field [4] . 

The characteristics of the Tr, Tm corpora are summarised in Table[3j Number 


Corpus 

# of abstracts 

of tokens 

# of base statements 

Tr 

1,406 

90,427 

407,154 

Tm 

3,611 

319,810 

1,534,685 


Table 3: Basic statistics of the corpora 


of tokens is a sum of the word-length of the documents in the corpus and num¬ 
ber of base statements is the number of the base co-occurrence statements the 
SKIMMR tool extracted from the corpus. 

4-3.2. Graph Extraction 

To generate knowledge graphs from the text corpora, we use the approach 
introduced in Section |4~2) We construct the experimental graphs using only co¬ 
occurrence statements with above-average positive normalised point-wise mu¬ 
tual information scores. This filters out statements with comparatively low 
co-occurrence weight. We use the general SKIMMR version that extracts enti¬ 
ties based on shallow parsing rather than domain-specific models (see https: 
//github. com/vitnov/SKIMMR for details). This is to demonstrate the general¬ 
ity of our work if we show that our approach can deliver good results even in 
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quite a specific domain using basic and universally applicable initial text mining, 
it indicates that it is likely to perform similarly well in any other domain. 

The characteristics of the extracted graphs are provided in Tables [4] and [5] 
The basic characteristics \Vg\, \Ea\,dnc |Cg|, | eg' 033 1, | Cq 9 \, |cg ed | in Table[S]are 


Graph 

\Vg\ 

\Eg\ 

dn G 

\Cg\ 

I „max | 

1 C G 1 

\cn y \ 

!<$?“* 1 

Tr 

16,714 

181,140 

1.297e-3 

80 

16,497 

208.925 

2 

Tm 

52,681 

635,705 

4.581e-4 

110 

52,373 

478.918 

2 


Table 4: Basic characteristics of the experimental graphs 

the number of vertices, number of edges, graph density, number of connected 
components, maximum, average and median component size in vertices, respec¬ 
tively. The component-wise characteristics in Tableware computed as a weighed 
arithmetic mean across all the components where the weight is the component 
size in vertices. The characteristics rdc,dmG are the graph radius and diam- 


Graph 

rd G 

drriG 

tr G 

aspc 

asp G 

Tr 

5.935 

8.897 

0.397 

4.045 

6.956 

Tm 

5.971 

8.954 

0.259 

3.964 

7.702 


Table 5: Component-wise characteristics of the experimental graphs 

eter (minimum and maximum eccentricity, respectively, where eccentricity of 
a vertex is its maximum distance to other vertices). The tre characteristics 
is transitivity - the fraction of all possible triangles reflecting the tendency of 
vertices in the graph to cluster together [35]. The characteristics aspG,asp s G 
are average shortest path lengths in terms of edges and the distance labeling, 
respectively. Additional characteristics of the graph is the degree distribution 
depicted in Figure [8] (the plot is log-scaled in both x- and y-axis). 

The extracted graphs both have one large connected component comprising 
most of the vertices, complemented by other trivial components mostly con¬ 
sisting of one edge. The largest components exhibit so called “small-world” 
property gS] - despite of being quite large and having small density, they have 
relatively small diameters and average shortest paths. This observation is sup¬ 
ported by two additional facts. The graphs have relatively high transitivity, 
i.e ., high tendency of vertices to cluster together which is typical for complex 
small-world networks |39j . Also, the vertex degree distribution approximately 
follows the power law as shown in Figure [8] which is characteristic for scale-free 
networks m • This means that the extracted graphs have relatively densely 
connected structure with many claims involving frequently repeated concepts, 
which is largely caused by highly (co)occurring terms. This is perhaps not ideal 
for making discoveries about as many previously disconnected phenomena as 
possible, and we later show how our approach can remedy this problem. 

4-3.3. Settings of the Clustering and Evolutionary Refinement Algorithms 

For the clustering, we use the K-means module of the seikit-learn pack¬ 
age [32]. As the algorithm’s scalability to large numbers of samples and fea- 
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Degree rank plots 



Figure 8: Degree distributions for the experimental graphs 


tures is limited by available memory, we partition the set of context vectors 
corresponding to the universe graph vertices to buckets of size 2, 000 and then 
run the K-means algorithm on them with the parameter K set to 40. The 
partitioning is done by incremental random selection of 50 seed vectors from 
the unpartitioned set, computing their centroid and then filling the partition 
with the seeds plus up to 1,950 unpartitioned vectors closest to the centroid. 
We have experimented with different settings of every parameter, however, we 
found out that the resulting distributions of vectors into clusters are practically 
invariant to the settings, with mean and median cluster sizes converging to the 
same values no matter what the settings were. 

The parameters of the evolutionary refinement were 


p m = 0.05, p c = 0.75, k m = 5, N g = 50, p p = 0.05, pi = 100, cr* = 80. 


The initial individual size parameters are only reflected in rare extreme cases as 
the size of the random stars is much more dependent on the data set structure 
in practice. For the other parameters, we applied values typically used by model 
approaches presented in the evolutionary computing literature [2j. The number 
of generations has been set well above a threshold after which the performance 
of the corresponding populations starts to oscillate around similar evaluation 
scores (see Section 4.4.1 for details). 

The evolutionary refinement with these parameters took 56m and 6h36m 
for the Tr,Tm experiments, respectively, using a 2010 make laptop with 4- 
core CPU, 8GB RAM and Ubuntu Linux 14.04 OS. The virtue measures (the 
most demanding part) were computed using six parallel processes. The number 
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of processes can be easily adjusted to the computing power available, which 
facilitates vertical scalability of the refinement. Horizontal scalability is planned 
for future versions of the prototype and consists of using a distributed processing 
library instead of the native Python multiprocessing module. 

4.3.4. Evaluation Methodology 

We use several evaluation methods. Part of them is based on a recent 
work which defines evidence-based and literature frequency-based evalua¬ 
tion measures within the de facto standard literature-based discovery scenarios 
elaborated in [2J EH] • The additional benefit of using [3] as a primary reference 
for the evaluation is that the authors compared results of several representa¬ 
tive approaches to literature-based discovery. Thus we can interpret our results 
within a broader context of the whole field. In addition to the measures defined 
in [4], we perforin qualitative evaluation of the actual contents of the results 
and compare ourselves to related state of the art where applicable. 

The evidence-based evaluation measures the capability of an approach to re¬ 
discover the intermediate concepts linking the source and target in the corpus 
as per discoveries made by human experts. It also measures the importance the 
approach associates with the re-discovery. For an intermediate t c , the absolute 
evidence-based evaluation measure directly corresponding to [4j is defined as 

evd{t c ) = min (rnk(G)), 

where Q c = {G|f s , t t , t c £ V G A 3p G n(G).p = (t a , ...,t c ,..., t t )} is a set of 
solution graphs that contain the source and target terms t s , tt linked by the 
intermediate ] The function rnk : Q —► N is a ranking of all solution graphs 
Q = {G\t s ,tt £ Vg} from the most to the least relevant where the relevance is 
determined by the specific approach being evaluated. 

We construct the sets of ranked solution graphs from the set of individuals in 
a selected refined generation by: 1. Creating a union graph from all population 
individuals. 2. Generating a set of paths between the source and target term 
vertices that also contain an intermediate vertex. 3. Ranking the paths using 
their hypothesis virtue measures, i.e., the >- relation, with the population union 
graph as a universe. The step 2. can either compute all simple paths or all 
shortest paths. In our experiments, we use the latter option due to tractability 
issues. The conception of paths as solution graphs represents another design 
choice consistent with the previous definitions - a path linking certain concepts 
is the simplest way of claiming (and potentially also explaining) something about 
them. 


7 Note that for mapping terms to vertices in the resulting knowledge graphs, we use the 
fulltext index computed upon the lexical expressions corresponding to the graph vertices. 
This is done when generating the universe graph, see Section E ■ 2.1| for details. To get all 
term manifestations in our automatically extracted knowledge graphs, we look up the term of 
interest in the index and then manually prune the results to get all alternatives that refer to 
the corresponding concept. 
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In addition to the absolute evd score, we compute the overall relative im¬ 
portance of an intermediate term t c . This measure is defined as a mean relative 
inverse rank of the graphs that contain t c among all solutions, i.e., 


evd r (t c ) 


1 

W\ 


E 

Gee 


\G\ — rnk(G) + 1 

\Q\ 


It effectively measures the average relative relevance of the hypotheses linking 
the source and target terms via t c - the more often the link is discovered in 
high-ranking graphs, the higher the measure. 

The second evaluation method proposed in [3] measures the frequency of the 
discovered claims in the scientific literature. Similarly to our definition, a path 
in the result graph is considered a claim in j4j. The literature frequency can be 
used to define a measure of solution rarity as 


rar(G) 


1 

W£/)| 


E fpm{QA{p)), 


where Gi = {G|G £ G A 3t c £ I c ,p £ II(G).p = (t s ,... ,t c ,... ,t t )} is a set of 
solutions that contain an intermediate term, tt s (Gi) = Uggs n s{G) is the union 
of shortest paths taken across Gi , and f pm is the number of results returned by 
PubMed for an association query Qa(p )• The query for a path (pi,p 2 , ■ ■ ■ iP\p\) 
corresponds to the conjunction /\ tgp t of all terms in the path (with a publica¬ 
tion time window limited according to the corresponding experimental corpus). 
For instance, the path (fish oil, platelet aggregation, Raynaud’s syndrome) corre¬ 
sponds to the PubMed query "fish oil" AND "platelet aggregation" AND 
"Raynaud’s syndrome" AND ("0001/01/01" [PDAT] : "1985/11/30"[PDAT]) 
in the Tr experiment. Finally, the rarity measure can be straightforwardly used 
for defining an interestingness measure @j as a normalised inverse of the rarity 


int(G) 


1 

1 + rar(G) 


The qualitative evaluation of the results is based on the sets of topics covered 
by the particular solutions. A topic is informally defined by potentially relevant 
terms that lay on a path between source and target concepts in a solution. 
Potentially relevant terms are those that refer to non-trivial concepts that may 
elucidate the meaning of the particular path. Using the notion of topics, we 
define the measures of topical density, relative topical relevance and relative 
topical novelty, respectively, as 


tovA9) = ttm- toMS) 


\T re l(Gl)\ 

\T unq {Gi)Y 


top n (G ) 


| T nv l(Gl)\ 

\Trel(Gl)\ 


for a set G of all solution graphs. The sets T unq (Gi),T aU (Gi),T re i{Gi),T nv i{Gi) 
are sets of unique, all, relevant and novel topics covered by the solution graphs 
in G that contain an intermediate term. 
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The relevance of topics is determined by a review of the available scientific 
literature. This tells us whether or not a given set of terms can provide a mean¬ 
ingful and non-trivial explanation of the connection between the source and 
target terms. More specifically, a topic is considered relevant if and only if the 
following conditions are met simultaneously: 1. The terms in the topic refer to 
features of a biomedically relevant relationship that can be traced in literature. 
2. The relationship is associated with the corresponding target, source and in¬ 
termediate terms. 3. The relationship is not trivial it has to be a supported 
by genuine discoveries presented in literature, not statements of obvious merely 
occurring in articles. 

A novel topic is one that is relevant and not covered by any single published 
work in its whole. This can be determined using a publication search engine 
such as PubMcd, where we can check the number of results of a conjunctive 
query involving all terms in the corresponding claim path. If the number of 
results is zero, then the topic is unique. 

We compute the topd, top ri top n scores for the initially extracted and re¬ 
fined graphs in both experiments, focusing on solutions involving correspond¬ 
ing source, target and intermediate terms. Whenever applicable, we compare 
the relevant topics we generated with the topics (re)discovered by related ap¬ 
proaches. 

4-4- Results and Discussion 

We split this section into three parts - first we explain the process of selection 
of the refined graphs to be evaluated, then we analyse properties of the selected 
graphs, and finally we discuss the results of the evaluation. 

4-4-1. Selection of the Refined Graphs 

Before analysing the actual results of the evolutionary refinement, we have 
to select the generation we will focus on. A natural criterion for that is the 
performance of generations in terms of the evaluation measures. The relative 
ranking of intermediate concepts (ie., the evd r measure) is best suited for this 
task as it tells us to which extent the generations tend to “consider” the interme¬ 
diate connections important. Figure [9] shows how the mean evd r values for all 
intermediates evolve throughout the generations for each experiment. The blue 
and green lines represent the Tr,Tm experiments, respectively. The full lines 
correspond to mean values taken across all intermediate terms (also marked by 
the “star” character in the plot legend). The dashed lines are for mean values 
omitting intermediates that are not present in the given generation (marked by 
the “plus” character in the legend). 

For the Tr experiment, the generation 40 clearly performs best as it contains 
solution graphs for each intermediate term and their mean relative ranking is 
very high (within the top 20% of solutions). For the Tm experiment, the situa¬ 
tion is less clear. The best generation in terms of mean across all intermediate 
terms is number 39, however, if one takes only the present intermediates into 
account, the generations 35-38 all perform better. Yet we decided to further 
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generation 


Figure 9: Mean relative intermediate ranks through generations 

analyse the generation 39 as it covers three intermediates, while the generations 
35-38 only cover two. From here on, we refer to the selected generations by the 
T«\T|9 expressions, respectively. 

Further support for selecting the generation to be analysed can be drawn 
from the numbers of claims containing the source and target claims, and the 
numbers of such claims that also contain an intermediate term. The evolution 
of these values is depicted in Figure [To] Note that the figure’s y-axis is log- 
scaled due to different orders of magnitude of the displayed values. Similarly to 
the previous figure, the blue and green lines represent the Tr,Tm experiments, 
respectively. The full lines correspond to the total number of claims containing 
the source and target term in a given generation (also marked by the “t” char¬ 
acter in the plot legend). The dashed lines represent the fraction of the claims 
that also contain an intermediate term (marked by the “r” character). 

The total number of relevant claims is steadily decreasing up until approx¬ 
imately 20-25th generation and then starts to oscillate. For the relative num¬ 
ber of solutions with intermediates, similar trend can be seen after the 40-th 
generation. This can be interpreted as an indication that the generations are 
structurally stabilised then, at least for the evaluation data we work with. 

4-4-%- Properties of the Refined Graphs 

Before we proceed with discussing the results, let us have a look at the 
characteristics of the knowledge graphs corresponding to the generations we 
selected for evaluation. Tables [6] and [7] present the same type of data like the 
tables in Section |4.3.2| The extra rows with the A prefixes show the relative 
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Figure 10: Claim numbers through generations 


difference between the refined and initial graphs. The columns represent exactly 


Graph 

\Vg\ 

\e g \ 

dn G 

\c G \ 

\c?i ax \ 

\c G y \ 

\ c rne a \ 

n u 

10,879 

15,940 

2.694e-4 

81 

10,670 

134.309 

2 

A T r 

0.651 
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0.208 

1.013 

0.647 

0.643 

1 

r r‘6\) 

1 M 

37,782 

65,263 

9.144e-5 

129 

37,431 

292.884 

2 

A T m 

0.717 

0.103 

0.2 

1.173 

0.715 

0.612 

1 


Table 6: Basic characteristics of the evolved experimental graphs 


the same measures as in the tables in Section 4.3.2 - number of vertices, number 
of edges, graph density, number of connected components, maximum, average 
and median component size in nodes (| Vq \, \ Eg |, dnc, \ C G |, | Cq ax |, | c'q 9 |, | CQ ed |), 
and the graph radius, diameter, transitivity and average shortest path lengths in 
terms of edges and the distance labeling (rdc, dmo, trc, cispc, clspq). Figure [TT] 


Graph 

rd G 

drriQ 

tr G 

asp G 

asp% 


8.847 

14.743 

0.015 

6.722 

12.309 

ATr 

1.491 

1.657 

0.038 

1.662 

1.77 


7.936 

13.885 

0.014 

6.349 

13.721 

A T m 

1.329 

1.551 

0.054 

1.602 

1.781 


Table 7: Component-wise characteristics of the evolved experimental graphs 


contains plots of the degree distribution in the refined graphs. 

The refined graphs still contain about 65% and 72% of the original vertices 
for the Tr,Tm experiments, respectively, however, the edges are much more 
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Degree rank plots 



Figure 11: Degree distributions for the evolved experimental graphs 


pruned, to about 9% and 10%, respectively. The graph density is thus lower, 
too (at about 20%). The numbers of connected components do not change much. 
This is only to be expected given the nature of the population preparation and 
the tendency of the evolution process to preserve connectedness. The sizes of 
the components are more or less proportional to the reduction of the vertex 
number. 

What is more interesting are the component-wise characteristics of the re¬ 
fined graphs summarised in Table [7] The radius, diameter and average shortest 
path lengths are all increased by up to 78% and no less than 32%, despite of the 
graphs getting smaller. The clustering coefficient decreases quite radically - to 
about 3.8% and 5.4% of the original value for the Tp. Tm experiments, respec¬ 
tively. The vertex degree distribution still approximately follows the power law, 
however, the curve is not as steep as for the original graphs. These combined 
characteristics indicate that the refined graphs exhibit the small world property 
to much lower extent than the original ones. This means that they are struc¬ 
turally more evenly organised and tend to have less vertices or vertex groups 
that connect large portions of the graph through very few edges. A possible 
consequence of this fact is lower redundancy and higher rate of non-obvious 
connections in the refined graphs. Indeed, the analysis of the data w.r.t. the 
standard literature-based discovery application scenarios confirms this, as we 
show in the next section. 
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4-4-3- Performance of the Refined Graphs 

In this section, we first discuss the performance of our experiments w.r.t. the 
quantitative measures used by related approaches. This is then followed by 
qualitative analysis of the knowledge graphs we generated. 

Table[8]lists the values of the evd measure for the Tr data set. Our approach 
(the N column) is compared to the works ESI ITT1 fTB] in columns C, S, W, 
G, H, respectively. For our approach, we list both evd, evd r values, while for 
the others, only evd is present as they do not consider evd r . We also provide 
|G C |, i.e ., the number of solution graphs with intermediates. The evd numbers 


Intermediate 

evd 

N 

evd r 

|G C | 

C gj 

S 112] 

W [49] 

G n 

H [16] 

Blood Viscosity 

5 

0.98 

1 

15* 

2 

Y 

5 

8 

Platelet Aggregation 

20 

0.844 

18 

1 

1 

Y 

6 

17 

Vascular Reactivity 

73 

0.615 

4 

- 

1 

Y 

19 

- 


Table 8: Evidence-based evaluation 


T'jf data 


correspond to the best rank of the result that contains given intermediate term. 
The character means the intermediate cannot be found in any result for that 
approach. If there is “Y”, then the intermediate can be found in the results but 
no ranking is provided. Finally, the results with “*” in the C column indicate 
that the intermediate can only be found indirectly by manually exploring a 
broader context of the result jl]. 

Our approach finds all intermediate terms which makes its performance 
equivalent to or better than the related approaches in this respect. Blood vis¬ 
cosity and platelet aggregation are placed among the top 16% of the results 
(out of 205 in the Tr experiment) while vascular reactivity is considered to be 
relatively less important intermediate. 

Table [9] lists the same type of results as Table [8j only for the Tm experiment 
and slightly different set of related works. Note that the related works are 
sometimes inconsistent in the exact wording of the intermediate terms, therefore 
we only focused on nine out of eleven where we were able to clearly mash up 
the different alternatives of the term. 

The quantitative results of our approach are sparser than in case of the Tr 
experiment. This has been caused mainly by the minimalistic, domain-agnostic 
approach we chose, which resulted into relatively low coverage of the intermedi¬ 
ate synonyms appearing in the data (the fulltext mapping could only discover 
terms rather similar to the canonical intermediate form used as a query, while 
many synonyms are quite dissimilar strings). All related approaches but one 1111 
use term expansion and mapping using biomedical vocabularies like MeSH, and 


some even use quite extensive manual interventions (see Section 5.4 for details). 
Despite of these limitations, we re-discovered five out of nine intermediates. Out 
of these, only three were discovered using a mature-enough generation of the 
refined knowledge graph, though. 

For the intermediates we managed to find, we achieved results comparable 
to or better than the other approaches. For instance, three out of five related 
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Table 9: Evidence-based evaluation - . Tj data 


approaches were not able to re-discover the cortical depression intermediate 
which is considered very important in [45 . 

The overall results of the evidence-based evaluation are encouraging. In the 
Tr experiment, our approach performed better than SI US], worse than SHE] 
and equally to [49]. In the Tm experiment, we bettered SI SI HI] while SI El 
outperformed ut[^J In total, we did better than more than half of the related 
approaches in terms of the intermediate ranking. 

The rarity and interestingness measures for the two experiments are given in 
Table 10 We can only compare ourselves to 0] as the measures were defined and 


Experiment 

h 

rar(Q ) 

int(Q) 

c 

rar(Q) 

m 

int(Q) 

JT 

6.722 

0.13 

0 

1 

7^39 

± M 

0.367 

0.732 

0.56 

0.64 


Table 10: Claim frequency-based evaluation 


used there for the first time. The average results of our approach are lower than 
in 0] for the Tr experiment. However, the median rarity and interestingness of 
the paths generated in our experiment is 0 and 1, respectively - only about one 
third of the Tr path associations have non-zero frequency on PubMed. This 
means that two thirds of the claims generated by our approach have the same 
performance in terms of rarity and interestingness as in 0] . The average results 
of the Tm experiment are better in our case. More than 98% of the T M claims 


s Note that direct comparison of the ranking results is conceptually difficult since the ap¬ 
proaches generate rather varied forms of results, e.g., mere terms in EEl or oriented multi¬ 
graphs in [3]. However, we can at least give this basic summary, which we corroborate by 
analysing the actual contents of the results later on. We also further discuss the major com¬ 
parative benefits of our approach in Section |5.41 
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have zero rarity which clearly outperforms [1]. 

Before we continue with the qualitative analysis of the results, let us get back 
for a while to the structural properties of the refined knowledge graphs. Table [TT] 
gives average relative ranking of the vertices corresponding to the source, tar¬ 
get and intermediate terms in the initially extracted and refined graphs. The 
rankings are based on the vertex degree and betweenness centrality measures 
(from highest to lowest). These measures are typically used as an approxima¬ 
tion of a vertex importance within a graph [39] . The importance of sources, 
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0.522 
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0.708 

0.537 

0.592 

0.656 

0.736 

Intermediates 

0.563 

0.729 

0.557 

0.57 

0.678 

0.699 

0.584 

0.575 


Table 11: Degree-based ranking of the re-discovery terms 


targets and intermediates in terms of degree is increased by the refinement in 
both experiments. The increase is largest for source and target terms in the Tm 
experiment and for the Tr intermediates. The importance in terms of between¬ 
ness centrality is increasing relatively less, with the Tm intermediates actually 
becoming slightly less important. These observations are consistent with the 
evidence-based evaluation in the sense of “sensitivity” of the experimental data 
sets towards the source, target and intermediate terms. The refinement of the 
T r graph clearly raises the importance of all vertices, especially the interme¬ 
diates. Indeed, all the terms are present in relatively highly ranking claims of 
the resulting T^° graph. For the Tm data set where only the importance of 
the source and target vertices is markedly rising, the results are much sparser - 
although the T^f graph contains many claims connecting the source and target, 
there is relatively few intermediates from |45| present in these claims. 

The qualitative analysis of the solution contents further elaborates on the 
above observations about the initial and resulting graph structure. As specified 
in Section |4.3.4[ the analysis is based on the topics covered by the solution 
graphs. These are terms that provide additional context for the intermediates 
in the solutions. We provide comprehensive lists of the unique context topics in 
Appendix A, together with references to supporting literature. 

The contents of the Appendix A is summarised in Table |l2| which contains 
the topd, top r , top n score values for the initial and refined knowledge graphs 
in both experiment^ The table shows that our approach improves the quality 
of the extracted knowledge graphs. The relative topical density top r (be., the 
ratio of unique topics among the paths connecting source and target terms) 
increases by about 27% and 122% for the Tr,Tm experiments, respectively. 


9 As we are not experts in the domains involved, we adopted a very conservative strategy 
for determining the topic relevance. If we could not directly verify any particular relationship 
between biomedical concepts present in the solution graphs using a review of published lit¬ 
erature via PubMed, we asserted the corresponding solution irrelevant. We encourage more 
knowledgeable readers to suggest possible updates of the detailed tables in Appendix A. 


34 




















Score 

T 

r r0 

1 R 

R 

7^40 

1 R 

T 

7^0 

1 M 

M 

7^39 

1 M 

topd 
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0.818 
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0.75 

0.607 

0.889 

top n 

0.938 

0.889 

0.471 

0.875 


Table 12: Summary of the qualitative evaluation 


The relevance top r increases by about 31% and 46% for Tr,Tm, respectively. 
Finally, the relative topical novelty top n increases by about 86% for the Tm 
experiment. In case of the Tr experiment, the measure is slightly lower for T'^ 
than for T however, there is only one non-novel solution in both knowledge 
graphs. The decrease in the relative top r value is caused by lower total number 
of solutions in the refined graph. 

These results confirm our assumption that the refinement improves the qual¬ 
ity of statements extracted from literature, at least in the context of two stan¬ 
dard literature-based discovery scenarios. The improvement in quality is three¬ 
fold. Firstly, the refined knowledge graphs are less redundant (the topical den¬ 
sity is higher). Secondly, there is markedly more relevant solutions in the results. 
And thirdly, the refined solutions are largely non-obvious (high top n measure). 

Direct and exact comparison of our qualitative results to related state of 
the art is unfortunately impossible due to the afore-mentioned differences in the 
solution representations. However, we can at least discuss the commonalities 
and differences informally. Figure [12] displays the hierarchy of topics covered 
by the Tr results. Each vertex in the hierarchy graph represents a part of the 
topic. The roots of the presented hierarchies are the intermediate concepts. 
The vertices shared across multiple topics have normal outlines. Vertices that 
complete the topics on the way from the root have bold outlines. 

The impact of glyceryl trinitrate on vasodilation and consequently also on 
blood flow has been studied in the context of possible treatment of Raynaud’s 
syndrome P2j. Our method reflects these findings in constructing a correspond¬ 
ing connection between Raynaud’s syndrome and platelet aggregation which is 
quite closely related to blood flow tm. Phosphatidylcholine, also a relatively 
common vertex in the generated claims, refers to a class of phospholipids that is 
closely related to metabolism of fatty acids, including those found in fish oils [Ti. 
The vertices connected to phosphatidylcholine mediate the relationship between 
fish oil and platelet aggregation in the solutions. The topic with ADP-induced 
platelet aggregation [35] specifies the type of platelet aggregation fish oils can 
influence. The solutions concerned with anti-thrombotic effect put this vertex 
in connection with fish oils, possibly with an intermediate vertex referring to 
myocardial infarction. This corresponds to the anti-thronrbotic effect of fish 
oils demonstrated for instance in [52] ■ Finally, one of our solutions identified 
a link between platelet aggregation and fish oils via their influence on levels of 
plasma-beta thromboglobulin, a marker in ischemic heart disease |13j . 

The solution involving the vascular reactivity intermediate puts it in the 
context of influence of fish oils on lower vascular resistance, as discussed for 
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Figure 12: Hierarchy of relevant topics in Tp° 
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instance in [2H|. The effect of local cooling on the digital systolic pressure is 
inherent to Raynaud’s syndrome but its connection to the vascular reactivity 
discovered by our approach is more indirect. One branch of the solution in¬ 
volving the blood viscosity intermediate revisits the relationship between fish 
oils and ischemic heart disease observed in one of the claims containing platelet 
aggregation. The other branch is new, though, and puts the blood viscosity in 
relation with high levels of fibrinogen in Raynaud’s syndrome patients |4fij . 

When comparing the contents of the Tr solutions with related state of the 
art approaches, we can only refer to [4] and [44] as the other works generate 
mere lists of possible intermediates without further context. Many contexts 
associated with the intermediates as possible explanations of the connections 
are missing in the related works. Examples are blood flow, glyceryl trinitrate, 
ADP-induced platelet aggregation, phosphatidylcholine or plasma-beta throm- 
boglobulin within ischemic heart disease. However, most of these connections 
are rather explanatory and not essential in the scope of Raynaud’s syndrome de¬ 
spite of being valid. In [3] , many of the graphs involve epoprostenol (essentially 
a prostaglandin) as a mediator of the influence of fish oils on platelet aggrega¬ 
tion. This is consistent with [ 33 ] that establishes the connection between fish 
oil and platelet aggregation as a result of increased level of prostaglandins. This 
context is missing in our results that involve the intermediates, however, it is 
present twice among the top-ten solutions (at ranks 4 and 8). Once it appears 
in relation to the action of the drug indomethacin, and then also in relation 
to luteolytic activity in women with Raynaud disease. These are potentially 
interesting findings that extend the results produced by comparable state of the 
art approaches. 

Figure [13] displays the hierarchy of topics covered by the T M results. One 
solution involving the epilepsy intermediate puts it in the context of magnesium 
being used as a mechanism for management of reverberating brain waves m- 
These are associated with epilepsy and vertigo attacks and the solution sug¬ 
gests that treatments for these conditions may be used for migraine as well. 
Other claims related to epilepsy all share multifocal EEG abnormalities which 
are characteristic for epilepsy [2j5f. Two different types of claims were comple¬ 
menting these findings - two solutions dealing with magnesium concentrations 
in cerebrospinal fluid in relation to migraine mi and one solution related to 
transmitter release and nerve stimulation. The solutions involving the cortical 
spreading depression intermediate were all related to similar concepts as the 
epilepsy ones. This is not surprising, since cortical spreading depression is quite 
closely related to seizures [101- 

Similarly to the Tr experiment, we can only compare the contents of our 
Tm solutions to [ 3 ] and |45j which are the only works that provide context in 
addition to the intermediates. The graphs presented in [4] for migraine and mag¬ 
nesium are generally much sparser than those for Raynaud and fish oil (typically 
containing only the source, target and intermediate node). Moreover, none of 
the results discussed in the article in detail concern epilepsy or spreading depres¬ 
sion. The work [45] confirms close relationship between epilepsy and cortical 
spreading depression which is consistent with a straightforward interpretation 


37 




Figure 13: Hierarchy of relevant topics in 
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of our results. Our solutions also managed to bring up the relationship between 
magnesium and cerebrospinal fluid in the context of epileptic attacks. In addi¬ 
tion to that, our results appear to strongly associate migraine with multifocal 
EEG abnormalities. This is consistent with relationship between the abnor¬ 
malities and headaches demonstrated for instance in m- Other potentially 
interesting findings not covered by related works are vertigo, reverberation and 
the relationship between migraine, cortical spreading depression and rolandic 
epilepsy [SO] . 

5. Related Work 

We split this section into four thematic blocks that correspond to the main 
theoretical and application-specific facets of our work. In particular, we review 
the areas of: 1. automated discovery, 2. ontology learning, 3. discovery supported 
by knowledge graphs, 4. literature-based discovery. 

5.1. Automated Discovery 

Research of ways how discoveries can be automated or facilitated by ma¬ 
chines dates back to the dawn of the digital computer era. The work [27] 
provides a comprehensive analysis of the discovery process operationalised as 
creative problem solving. It reviews several classic machine discovery systems 
and the heuristics used by them, and also mentions several properties of worthy 
discoveries like novelty and value. A more recent related work [18] reviews the 
major approaches to studying the process of scientific discovery, provides an¬ 
other survey of automated discovery systems and analyses additional features 
of relevant discoveries, such as surprise. The works ED ED] review still more 
machine discovery systems and heuristics, and identify features like refutability 
and simplicity as essential to discoveries. One of the most recent and relevant 
works from this area is [23]. It builds on Eli m] and introduces formalisations 
of several discovery features. In particular, it models novelty and value using 
metric spaces, and surprise using Bayesian probabilities. 

Discovery features discussed in the referenced works conform to our virtue 
definitions, although most of them do not provide a systematic formalisation, 
only rather application-specific implementations. For instance, refutability and 
simplicity as reviewed in [21] directly correspond to our virtues. Surprise and 
novelty discussed in the other works can be modelled by putting emphasis on 
radical claims as addressed by the conservatism virtue, only using different dis¬ 
tance metrics for each of the respective features. We believe that our approach 
presents a new way to formalising discovery features that is consistent with re¬ 
lated state of the art, but is more systematic, comprehensive and extensible. In 
addition, we provide an actionable set of measures implemented in the context 
of knowledge graphs. This enables universal applicability of our research, which 
is not the case in most of the rather specific afore-mentioned approaches. 


39 


5.2. Ontology Learning 

In the last fifteen years, there has been a growing interest in exploring the 
potential of automatically extracted graph structures for knowledge discovery. 
Many of such approaches can be clustered under the umbrella of ontology learn¬ 
ing l22| which aims at extracting complex statements from unstructured textual 
resources. This is done using specifically tailored methods from AI disciplines 
like natural language processing or machine learning. 

As a recent survey m shows, the applicability of existing ontology learning 
approaches to (semi) automated knowledge discovery is still quite limited. Many 
of the techniques are dependent on manually curated resources. They also 
introduce a lot of assumptions during the extraction process (based on, for 
instance, linguistic facts valid only in the context of a particular language or 
discourse). This limits their universal applicability. Another problem is that 
the more complex knowledge representation the learned ontologies use, the more 
restrictive they are about their meaning. This typically leads to brittleness 
w.r.t. the often inherently vague and contextual nature of the knowledge they 
represent. This can easily cause problems in machine-aided knowledge discovery 
scenarios where we typically want to represent the knowledge implied by the 
input data in as unbiased way as possible. Another practical limitation is that 
most ontology learning system do not scale very well as reported in m- 

5.3. Discovery Supported by Knowledge Graphs 

More recent works related to machine discovery using knowledge graphs 
include 00 I28| which contain also comprehensive reviews of prior similar 
approaches. The approach elaborated in |Bj presents methods for knowledge 
discovery in RDF [23] data based on user-defined query patterns and analytical 
perspectives. Our approach complements 0 by offering means for automated 
analysis and refinement of knowledge graphs using application-independent, 
well-founded features. 

Google’s Knowledge Vault 0 presents a web-scale approach to probabilistic 
knowledge fusion that uses graphs represented in the RDF format. It tackles 
the scalability vs. accuracy trade-off of the manual and automatic approaches 
to construction of knowledge graphs. This is done by refining statements ex¬ 
tracted from the web content using models learned from pre-existing highly ac¬ 
curate knowledge bases like YAGO or Freebase. Additional details and broader 
theoretical context of the approach introduced in [8] is given in [28], which of¬ 
fers a comprehensive review of relational machine learning approaches in the 
context of RDF-compatible knowledge graphs. The main advantage of our ap¬ 
proach w.r.t. the works 0HH] is that we are not critically dependent on the 
background knowledge model. In addition, we present a complementary well- 
founded approach to determining which relationships in automatically extracted 
knowledge graphs are worth preservation. Having said that, the techniques re¬ 
viewed in [28] can certainly provide valuable hints on future extensions of our 
approach to graphs with oriented edges representing more than one type of 
relationships (i.e., RDF graphs). 
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5-4- Literature-Based Discovery 

As our approach has been validated by experiments in literature-based dis¬ 
covery, we need to position ourselves within that field as well. Surveys of 
related works (focusing mostly on the domain of life sciences) are provided 
in 01311 HU- The specific approaches we compare ourselves to are described 
in m la m na sa nu. In most cases where we were able to directly compare 
our results with the related works, our approach was at least as good as and 
often better than the state of the art. In addition, we managed to hint at sev¬ 
eral relevant insights that were not even discussed by the human expert in the 
original studies [44], 45] . 

The most significant advantages of our approach are, however, as follows. 

1. It is absolutely automatic. The only manual action we performed was pruning 
the fulltext search results when mapping terms to the corresponding vertices, 
however, this is only required for the evaluation, not for the method itself. 

2. There are no domain-specific dependencies and thus our work is readily ap¬ 
plicable to any field, not just the biomedical literature-based discovery. 3. We 
produce extensive contextual information that can facilitate the interpretation 
of the results and thus make the machine-aided discovery process more efficient. 
4. Our approach is based on theoretical foundations motivated by the state of 
the art philosophical study of key features of scientific discoveries. 

The works [421 fl6| [49] all depend on rather extensive manual effort (def¬ 
inition of semantic types and discovery patterns, result pruning, etc.). The 
approaches mum are automated, however, only [4] provides broader context 
in order to elucidate the connections. Moreover, all works but El substantially 
rely on an external domain-specific source of background knowledge and/or 
domain-specific NLP tools, namely the MeSH and UMLS vocabularies [3] and 
the tools SernRep [38! and BioMedLEE [5,. It is quite plausible to assume that 
without these resources, the related approaches dependent on them would per¬ 
form much less favourably when compared to our implementation. Last but 
not least, all the related approaches lack the universally applicable theoretical 
foundations presented as the core contribution of this article. 

6. Conclusions and Future Work 

We have presented a novel approach to discovery informatics that is based 
on formalisation of hypothesis virtues in the context of knowledge graphs. We 
have shown that the approach is naturally motivated, well-founded, extensible 
and universally applicable. It can be used as a broader theoretical frame for 
other approaches to machine discovery, as briefly outlined in Section [5] We 
have delivered an implementation of the presented research and performed its 
experimental validation using standard scenarios in literature-based discovery. 
A successful comparison with related state of the art tools demonstrates the 
practical relevance of our work. 

In near future, we will extend the theoretical framework in order to address 
directed multi-graphs with predicate edge labels and more complex semantics 
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associated with particular edge and vertex types. This will allow for straightfor¬ 
ward application of our approach to more expressive knowledge graphs, such as 
RDF [24] knowledge bases and ontologies in the Linked Open Data cloud [14] . 
Furthermore, we intend to continue demonstrating the universality of our frame¬ 
work by using it in other experimental scenarios targeted by related works in 
machine discovery. We also plan to explore the complex relationships between 
specific measures and their influence on the properties of the evolutionary re¬ 
finement process ( e.g ., convergence, optimality and completeness bounds). This 
will lead to deeper understanding of the refinement, and therefore also to more 
efficient implementations. Finally, and perhaps most importantly, we would 
like to use our approach in scenarios involving actual new discoveries, in direct 
collaboration with corresponding domain experts. 
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Appendix A: Context Topics for the Intermediate Terms 


This appendix presents tables with the context topics of the intermediate 
terms for the initial and refined graphs in the Tr,7m experiments. Each table 
contains topics for one intermediate within a specific experimental graph. The 
topics are given in the first column. The second column of each table states 
whether or not the corresponding topics are relevant. Topics with relevant 
part subsumed by another relevant topic are not considered relevant unless 
they extend the subsumed part with new relevant information. If there is an 
exclamation mark in the second column (applicable only to relevant topics), it 
means the relationship expressed by the corresponding solution is novel, i.e., not 
covered in any single existing article published by March, 2015. The third 
column lists references of articles that jointly provide support of relevant claims. 
We use easily de-referencable PubMed identifiers for brevity. Note that more 
common and/or simple relationships may have alternative sets of supporting 
articles that do not appear in the lists provided here. 

First we provide three tables for the initial graph 7’/, in the Tr experiment 
(Table [TTiflTs ) and three tables for the refined graph Tjj, 0 (Table l(ij[l8| ). 

The topics for the Tm experiment are organised in five tables for the initial 
graph (Table |19||23| and in three tables for the refined graph Tff (Table [24]- 

r M. 
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Topics 

Rel. 

Support 

glyceril trinitrate, blood flow, myocardial in¬ 
farction 

Y! 

17311994, 14831190, 10604966, 
3092847, 9310278 

glyceril trinitrate, blood flow, precursor cl8, 
c20 

N 

N/A 

glyceril trinitrate, blood flow, phosphatidyl¬ 
choline, arachidonic acid 

Y! 

17311994, 9675609, 4205973, 

14831190,10604966, 3092847 

glyceril trinitrate, blood flow, aa and docosa- 
hexaenoic acid 

N 

N/A 

glyceril trinitrate, blood flow, phosphatidyl¬ 
choline, individual phospholipid, alteration and 
recovery 

N 

N/A 

glyceril trinitrate, blood flow, antithrombin iii 

Y! 

17311994, 18370504, 14831190, 
10604966, 3092847 

glyceril trinitrate, blood flow, adp-induced 
platelet aggregation 

Y! 

17311994, 10086317, 18370504, 
14831190, 10604966, 3092847 

glyceril trinitrate, blood flow, adp-induced 
platelet aggregation, corn oil 

N 

N/A 

glyceril trinitrate, blood flow, linseed 

N 

N/A 

glyceril trinitrate, blood flow, thrombin and 
collagen 

Y! 

17311994, 4468230, 18370504, 

14831190,10604966, 3092847 

glyceril trinitrate, blood flow, collagen, precur¬ 
sor cl8 

N 

N/A 

glyceril trinitrate, blood flow, adp-induced 
platelet aggregation, vasospastic disease, 
prostaglandin el 

Y! 

17311994. 10086317. 18370504, 
14831190, 10604966, 3092847 

glyceril trinitrate, blood flow, low pufa coconut 
oil adp-induced platelet aggregation 

N 

N/A 

alpha-adrenergic receptor 

N 

N/A 

prostacyclin, thromboxane, adp-induced 

platelet aggregation 

Y! 

10086317, 6258879, 19037602 

upper limb, arteritis, haemodynamic profile 

N 

N/A 

alpha-adrenergic receptor, adp-induced 

platelet aggregation 

Y 

34707 


Table 13: Topic contexts for platelet aggregation in TjJ 


Topics 

Rel. 

Support 

glyceryl trinitrate, blood flow, hand, cold 

N 

N/A 

glyceryl trinitrate, blood flow, hand, cold, my¬ 
ocardial infarction 

Y! 

24753696, 9310278, 10086317, 

18370504,14831190 

vascular dilation, cold, myocardial infarction 

Y! 

9310278, 15695304 

upper limb, arteritis, haemodynamic profile, 
myocardial infarction 

N 

N/A 

local cooling, digital systolic pressure, haemo¬ 
dynamic profile, less responsive, myocardial in¬ 
farction 

Y! 

9310278, 22453196, 17876193 

cold spell, radiological investigation 

N 

N/A 


Table 14: Topic contexts for vascular reactivity in TS 
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Topics 

Rel. 

Support 

finger systolic pressure, dazoxiben treatment, 
high level of plasma fibrinogen 

Y! 

6393521, 7459607 

fibrinolytic enhancement, high level of plasma 
fibrinogen 

Y! 

698554, 7459607 

fibrinolytic enhancement, deformability index 

Y! 

698554, 519354 

finger systolic pressure, dazoxiben treatment, 
high level of plasma fibrinogen, shear rate 

Y! 

6393521, 7459607, 579511 

finger systolic pressure, dazoxiben treatment, 
high level of plasma fibrinogen, shear rate, fib¬ 
rinolytic enhancement, deformability index 

Y! 

6393521, 7459607, 579511, 

519354 


Table 15: Topic contexts for blood viscosity in 


Topics 

Rel. 

Support 

n-3, phosphatidylcholine, blood flow, glyceryl 
trinitrate 

Y! 

17311994, 9675609, 24679762, 

14831190, 10604966, 3092847 

phosphatidylcholine, aa and docohexaenoic 
acid, arachidonic acid, blood flow, glyceryl 
trinitrate 

Y! 

17311994, 9675609, 4205973, 

14831190,10604966, 3092847 

plasma-beta-thromboglobulin, ischemic heart 
disease, platelet count, initial decrease, blood 
flow, glyceryl trinitrate 

Y! 

17311994, 6123019, 14831190, 

10604966, 3092847 

adp-induced platelet aggregation, blood flow, 
glyceryl trinitrate 

Y! 

17311994, 10086317, 18370504, 
14831190, 10604966, 3092847 

anti-thrombotic effect, blood flow, glyceryl 
trinitrate 

Y! 

17311994, 6294902, 14831190, 

10604966, 3092847 

anti-thrombotic effect, myocardial infarction, 
blood flow, glyceryl trinitrate 

Y! 

17311994, 6294902, 9310278, 

14831190,10604966, 3092847 

n-3, phosphatidylcholine, blood flow, glyceryl 
trinitrate, alteration and recovery 

N 

N/A 

n-3, phosphatidylcholine, blood flow, glyceryl 
trinitrate, alteration and recovery, individual 
phospholipid 

N 

N/A 

precursor cl8, phosphatidylcholine, blood flow, 
glyceryl trinitrate 

Y! 

17311994, 9675609, 4205973, 

14831190,10604966, 3092847 


Table 16: Topic contexts for platelet aggregation in Ti° 


Topics 

Rel. 

Support 

local cooling, digital systolic pressure, less re¬ 
sponsive, haemodynamic profile 

Y 

22453196, 17876193 

upper limb, arteritis, haemodynamic profile, 
myocardial infarction 

N 

N/A 


Table 17: Topic contexts for vascular reactivity in T^P 


Topics 

Rel. 

Support 

plasma-beta-thromboglobulin, ischemic heart 
disease, high level of plasma fibrinogen, shear 
rate 

Y! 

6123019, 579511, 519354 


Table 18: Topic contexts for blood viscosity in T^P 
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Topics 

Rel. 

Support 

acute intrapartum fetal distress 

N 

N/A 

flunarizine, seizure 

Y 

25754865, 3332609, 22406257 

toxemia, horton 

Y! 

25754865, 25373431, 14401618, 
6557975 

bulbo-cortical pathway 

Y 

25754865, 3609865 

sea water 

N 

N/A 

m.e.p.p., inhibitory effect of prostaglandin on 
vasopressin 

N 

N/A 

brachymetapody, inhibitory effect of 

prostaglandin on vasopressin, cyanosis 

N 

N/A 

fibrinolysis, lipid, vascular disease 

Y! 

25754865, 20669129, 25737193 

antiserotonin, toxemia 

Y! 

25754865, 25373431, 5315778 

spasmophilia, magnesium sulfate 

Y 

25754865, 9340190 

cyanosis, limb, inhibitory effect of 

prostaglandin on vasopresin, magnesium 
excretion 

N 

N/A 

conventional therapy for vertigo 

Y 

25754865, 23837033 

amonium 

Y 

25754865, 10897167 

verapamil, magnesium blockade 

Y! 

25754865, 23973639, 24113539, 
8891316 

increase in bathing, washing 

Y! 

25754865, 2294020025667882 

ketonic body 

Y 

25754865, 24300035 

merskey, cerebrospinal fluid 

Y 

25754865, 6100318 


Table 19: Topic contexts for epilepsy in 


Topics 

Rel. 

Support 

alkaline phosphatase, hydroxyproline 

N 

N/A 

magnesium chloride 

Y! 

25238714, 25010639, 24828386, 

immunoreactive, hyposmotic stress 

N 

N/A 

Table 20: Topic contexts for prostaglandin in 

Topics 

Rel. 

Support 

magnesium blockade, benign syndrome 

N 

N/A 

Table 21: Topic contexts for cortical spreading depression in 

Topics 

Rel. 

Support 

indomethacin 

Y 

22529203, 2925371 

high calcium 

Y! 

12010379, 2925371, 15152357 


Table 22: Topic contexts for vascular reactivity in Tjy 


Topics 

Rel. 

Support 

fatty acid level, aspirin, dipyridamole, epilepsy 

Y! 

25741817, 25116182, 10775263, 
22749692 

cholesterol 

N 

N/A 

endocarditis, vasculomotor reaction, epilepsy 

Y! 

3513926, 21762000 

prostaglandin synthesis, potent anti¬ 

inflammatory agent 

N 

N/A 


Table 23: Topic contexts for platelet aggregation in 
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Topics 

Rel. 

Support 

transmitter release during repetitive nerve ac¬ 
tivity, nerve stimulation, multifocal EEG ab¬ 
normality 

Y! 

25754865, 2446959,13175198 

multifocal EEG abnormality, magnesium sul¬ 
fate 

Y! 

25754865, 2446959, 23256267 

vertigo, convulsion treatment, reverberation 

Y! 

25754865, 23837033 

multifocal EEG abnormality, merskey, cere¬ 
brospinal fluid 

Y! 

25754865, 2446959, 6100318 

multifocal EEG abnormality, plasma, increase 
in magnesium concentration in csf 

Y! 

25754865, 2446959, 3981211 


Table 24: Topic contexts for epilepsy in 


Topics 

Rel. 

Support 

childhood epilepsy with rolandic spike, hemi¬ 
sphere 

Y! 

25754865, 19271946, 19674062, 
22961355 

hemiplegia, childhood 

Y 

25754865, 23907418, 21490217 

partial occipital epilepsy, reverberation, mag¬ 
nesium blockade 

Y! 

25754865, 23907418, 1283483, 

1110377 


Table 25: Topic contexts for cortical spreading depression in 


Topics 

| Rel. 

Support 

high calcium, tension headache 

_1 N 

N/A 


Table 26: Topic contexts for vascular reactivity in 
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